United States Patent iw 

Mahadevan et al. 



nniiiiiiiHif mmu 



US0057970I3A 
[li] Patent Number: 
[45] Date of Patent: 



5,797,013 
Aug. 18, 1998 



[54] INTELLIGENT LOOP UNROLLING 

[75] Inventors: Uma Mahadevan. Sunnyvale; Lacky 
Shah. Fremont, both of Calif. 

[73] Assignee: Hewlett-Packard Company. Palo Alto. 
Calif. 

[21] Appl. No.: 564,514 

[22] Filed: Nov. 29, 1995 

[51] IntCI. 6 G06F9/45 

[52] U.S. 0 395/709; 395/588 

[58] Field of Search 395/705. 709. 

395/588. 580 

[56] References Cited 

U.S. PATENT DOCUMENTS 

5,265253 11/1993 Yamada 395/700 

5367,651 11/1994 Smith et al 395/700 

5386.562 1/1995 Jain et al 395/650 

OTHER PUBLICATIONS 

"A Comparative Evaluation of Software Techniques to Hide 
Memory Latency". John et al., Proc. of the 28* Ann. Hawaii 
InH Conf.. 1995. pp. 229-238. 

"Schedule driven Loop Unrolling for Parallel Processors". 
System Sciences, 1991 Annual Hawaii Int'l Conference. 
1991. vol. H pp. 458-^67. 



"Aggressive Loop Unrolling in a Retargetable. Optimizing 
Compiler". Davidson et al., Dept of Comp. Science. Univ. of 
Va. pp. 1-14. 

"Unrolling Loops in Fortran." Dongarra et al.. Soft. Practice 

and Experience, vol. 9, 1979. pp. 219-226. 

Hendren et al.. "Designing Programming Languages for the 

Analyzability of Pointer Data Structures. 1 * Comput. Lang., 

vol. 19, No. 2. pp. 119-134 (1993). 

Weiss et al. . * 4 A Study of Scalar Compilation Techniques for 

Pipelined Supercoraputers, M ACM. pp. 105-109 (1987). 

Primary Examiner— Emanuel Todd Voeltz 
Assistant Examiner— Kakzli Chaki 



[57] 



ABSTRACT 



A compiler facilitates efficient unrolling of loops and 
enables the elimination of extra branches from the loops, 
including the elimination of conditional branches from 
unrolled loops with early exits. Unrolling also enhances 
other optimizations, such as prefetch, scalar replacement, 
and instruction scheduling. The unroll factor is calculated to 
determine the amount of loop expansion and the optimum 
location to place compensation code to complete the original 
loop count, i.e. before or after the unrolled loop. The 
compiler is applicable, for example, to modern RISC 
architectures, where the latency of memory references and 
branches is higher than that of integer and floating point 
arithmetic instructions. 

16 Claims, 13 Drawing Sheets 
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I=N 

BEGIN LOOP: 



A[I]=B[I]+X 
1 = 1-1 

IF (I>0) GOTO BEGIN_LOOP; 



END_LOOP: 
PRINT I 



150 



Fig. 3 

(PRIOR ART) 



1 = N 

BE GIN_LOOP: 

A[I] = B[I] + K 
1 = 1-1 



401 



IF ( 1 < 0) GOTO END_LOOP 

"405 



C[I1 = A[I] + X 



GOTO BEGIN_LOOP 
END_LOOP: 

PRINT I 



403 



Fig. 7a 

°(PRIOR ART) 



BEGIN LOOP: 



A[I] = A[I-2]/B[I] 
1 = 1 +1; 



367 



IF (I < N) GOTO BEGIN.LOOP 
END.LOOP: 

Fig. 9 

t (PRIOR ART) 



11/8/04, EAST Version: 2.0.1.4 



U.S. Patent Aug. 18, 1998 Sheet 5 of 13 5,797,013 



301 



I = N 

IF (I<4) GOTO BEGIN_COMPENS ATION.LOOP 
UNROLLED LOOP: 



A[I] = B[I] + X 
1 = 1-1 



A[I]=B[I]+X 
1 = 1-1 ' 



A[I]=B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 




333 




159 



IF (I >= 4) GOTO UNROLLEDJLOOP 
END_UNROLLED_LOOP: 

IF (I = 0) GOTO END_COMPENS ATION_LOOP 
BEGIN COMPENSATION LOOP: 




303 



A[I] = B[I] + X 
1 = 1-1 

IF (I>0) GOTO BEGIN_COMPENSATION_LOOP; 



END_COMPENSATION_LOOP: 
PRINT I 

Fig. 4 . 

°(PRIOR ART) 
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311 



1 = N * 
IF (1 < 5) GOTO BEGTN_COMPENSAT10N_LOOP 
UNROLLED JLOOP: 



A[I] = B[I]+X 
1 = 1-1 



A[I] = B[I] + X 
1=1-1 



A[I] = B[I] + X 
1 = 1-1 



A[l] = B[I] + X 
1 = 1-1 




IF (I >= 5) GOTO UNROLLED_LOOP 
END_UNROLLED_LOOP: 
COMPENSATION.LOOP: 




321 



-323 



A[I] = B[I] + X i 
1 = 1-1 

IF (I>0) G5TO BEGIN_COMPENSATION_LOOP; 



END_COMPENSATION_LOOP: 
PRINT I 



Fig. 5 



11/8/04, EAST version: 2.0.1.4 



U.S. Patent Aug. 18, 1998 Sheet 7 of 13 



I=N 

J = ( (N-l) %) 4 + 1 NOTE: ( (N-=l) MOD 4) + 1 

I = N - J ; NOTE THAT T IS GUARANTEED 

TO BE DIVISIBLE BY 4 HERE. < " i 

COMPENSATION_LOOP: 



AIJ] = B[J] + X 
J = J -1 

IF (J > 0) GOTO BEGIN_COMPENSATION_LOOP: 



END_COMPENSATION_LOOP: 
IF (I = 0) GOTO END_UNROLLED_LOOP 
UNROLLED_LOOP: 



A[I] = B[I] + X 
1 = 1-1 



A[I] = B[I] + X 
1 = 1-1 



A[I] = B[U + X 
1 = 1-1 



A[I] = B[IJ + X 
1 = 1-1 




IF (I > 0) GOTO UNROLLED_LOOP 
END_UNROLLED_LOOP: 
PRINT I 



Fig. 6 
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I = N 

BEGIN.LOOP: 
UNROLLED LOOP: 




A[I] = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[IJ+X 



A[I] = B[I] + K 
1 = 1-1 

IF ( 1 < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[IJ + X 




377 



ALU = B[I] + K 

1=1-1 I 

IF ( T < 0) GOTO END_COMPENSATION_LOOP; 

C[I] = A[I] + X 




A[I] = B[I] + K 
1 = 1-1 

IF ( I < 0) GOTO END_COMPENSATION_LOOP; 
C[I] = A[I] + X 



END_UNROLLED_LOOP: 
BEGIN COMPENSATION_LOOP: 



A[I] = B[I] + K 
1 = 1-1 



365 



IF (I < 0) GOTO END_COMPENSATION_LOOP; 



C[I] = A[I] + X 



GOTO BEGIN_COMPENSATION_LOOP 
END_LOOP: 
PRINT I 

Fig. 7b 

° (PRIOR ART) 
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I = N 

IF (I<5) GOTO BEGIN_COMPENSATION_LOOP 
UNROLLED LOOP: 



A[I] = B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I]=B[I] + K 

1 = 1-1 

C[I] = A[I] + X 



A[I] = B[I] + K 

1 = 1-1 

C[I] = A[I]+X 




319 




312 



IF (I>=5) GOTO UNROLLEDJLOOP; 
END_UNROLLED_LOOP: 
BEGIN_COMPENSATION_LOOP: 

361 



A[I] = B[I] + K 
1 = 1-1 




365 



IF (I<0) GOTO END_COMPENSATION_LOOP; 

363 



C[I] = A[I] + X 



GOTO BEGIN_COMPENSATION_LOOP 
END_LOOP: 

PRINT I p ig> g 
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T = A[T-2] 
Tl = A[I-11 





= T/B[I1 ^ 



391 



Fig. 10 

(PRIOR ART) 



A[I] = T/B[I1 
T = T1 <- 
Tl =A[I] <r 
1 = 1+1; 

IF (I < N) GOTO BEGIN_LOOP 



393 

395 



END LOOP: 



T = A[I-2] 
Tl =A[I-1] 

BEGIN LOOP: 



A[I]=T/B[I] 

T = T1 < 

Tl =A[I] 
1 = 1+1; 



A[I]=T/B[I] 
T = T1 
Tl = A[I] 
1 = 1+1; 



A[I] = T/B[1] 

T = TK 

Tl = A[I] 
1 = 1+1; 



388 



Fig. 1 1 

391 (PRIOR ART) 
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395 



IF (I < N) GOTO BEGIN_LOOP 
END LOOP: 
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T = A[I-2] 
Tl = A[M] 



400 



BEGIN LOOP: 



A[I] = T/B[I] 
T2 = A[I] 



A[I+1]=T/B[I+1] 
T = A[I+1] 



401 



402 



A[I+2] = T2 / B [1+2] If 403 
Tl = A [1+2] 



410 



1 = 1 + 3; 
IF (I < N) GOTO BEGIN.LOOP 
END LOOP: 



Fig. 12 
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DETERMINE TRIPCOUNT IF KNOWN 

DETERMINE UNROLL FACTOR FROM LOOPLENGTH,PREFETCH 
SPAN,PROFILE,RESOURCE USAGE 



I 



IF TRIPCOUNT KNOWN AND TRIPCOUNT J 
MOD UNROLLF ACTOR =0 



203 



NO 



ULCODE= 

UNROLL(LOOP,UNROLLFACTOR,NOCOMP) 

REMOVEMIDDLEEXITS (ULCODE) 
EXCEPT THE FINAL EXIT 



5 



205 



ULCODE= 

UNROLL(LOOP,UNROLLFACTOR,COMPOK) 
REMOVEMIDDLEEXITS (ULCODE) 



5 



242 



OUTPUT(ULCODE) 



244 




226 



ULCODE=MOVE FINAL 
EXIT TO END OF LOOP 
(ULCODE) 



5 



229 



IF PLACECOMP( UNROLL FACTOR) = BEFORE 



I 



Li 



246 



OUTPUT ULCODE 
OUTPUT COMPCODE FOR LOOP 



I 



OUTPUT COMPCODE FOR LOOP 
OUTPUT ULCODE 



999 



DONE 



Fig. 13 
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INTELLIGENT LOOP UNROLLING 

BACKGROUND OF THE INVENTION 

1. Technical Field 

The invention relates to compilers. More particularly, the 
invention relates to techniques for unrolling computation 
loops in a compiler so as to generate code which executes 
faster. 

2. Description of The Prior Art 

FIG. 1 is a block schematic diagram of a uniprocessor 
computer architecture, including a processor cache. In the 
figure, a processor 11 includes a cache 12 which is in 
communication with a system bus 15. A system memory 13 
and one or more I/O devices 14 are also in communication 
with the system bus. 

FIG. 2a is a block schematic diagram of a software 
compiler 20, for example as may be used in connection with 
the computer architecture shown in FIG. 1. The compiler 
Front End component 21 reads a source code file (100) and 
translates it into a high level intermediate representation 
(110). A high level optimizer 22 optimizes the high level 
intermediate representation 110 into a more efficient form. A 
code generator 23 translates the optimized high level inter- 
mediate representation to a low level intermediate represen- 
tation (120). The low-level optimizer 24 converts the low 
level intermediate representation (120) into a more efficient 
(machine-executable) form. Finally, an object file generator 
25 writes out the optimized low-level intermediate repre- 
sentation into an object file (141). The object file (141) is 
processed along with other object files (140) by a linker 26 
to produce an executable file (150), which can be run on the 
computer 10. In the invention described herein, it is assumed 
that the executable file (150) can be instrumented by the 
compiler (20) and linker (26) so mat when it is run on the 
computer 10, an execution profile (160) may be generated, 
which can then be used by the low level optimizer 24 to 
better optimize the low-level intermediate representation 
(120). The compiler 20 is discussed in greater detail below. 

The compiler is the piece of software that translates 
source code, such as C, BASIC, or FORTRAN, into a binary 
image that actually runs on a machine. Typically the com- 
piler consists of multiple distinct phases, as discussed above 
in connection with FIG. 2a. One phase is referred to as the 
front end, and is responsible for checking the syntactic 
correctness of the source code. If the compiler is a C 
compiler, it is necessary to make sure that the code is legal 
C code. There is also a code generation phase, and the 
interface between the front-end and the code generator is a 
high level intermediate representation. The high level inter- 
mediate representation is a more refined series of instruc- 
tions that need to be carried out For instance, a loop might 
be coded at the source level as: 

which might in fact be broken down into a series of steps, 
e.g. each time through the loop, first load up I and check it 
against 10 to decide whether to execute the next iteration. 

A code generator takes this high level intermediate rep- 
resentation and transforms it into a low level intermediate 
representation. This is closer to the actual instructions that 
the computer understands. An optimizer component of a 
compiler must preserve the program semantics (i.e. the 
meaning of the instructions that are translated from source 
code to an high level intermediate representation, and thence 
to a low level intermediate representation and ultimately an 



2 

executable file), but rewrites or transforms the code in a way 
that allows the computer to execute an equivalent set of 
instructions in less time. 

Modern compilers are structured with a high level opti- 

5 mizer (HLO) that typically operates on a high level inter- 
mediate representation and substitutes in its place a more 
efficient high level intermediate representation of a particu- 
lar program that is typically shorter. For example, an HLO 
might eliminate redundant computations. With the low level 

io optimizer (LLO). the objectives are the same as the HLO, 
except that the LLO operates on a representation of the 
program that is much closer to what the machine actually 
understands. 

FIG. 2b is a block diagram showing a low level optimizer 
15 for a compiler, including a loop unrolling component 30 
according to the invention. The low level optimizer 24 may 
include any combination of known optimization techniques, 
such as those that provide for local optimization 35, global 
optimization 36. loop identification 37, loop invariant code 
20 motion 38, prefetch 34 .register reassociation 31. and instruc- 
tion scheduling 32. 

Source programs translated into machine code by com- 
pilers consists of loops, e.g. DO loops. FOR loops, and 
WHILE loops. Optimizing the compilation of such loops 
25 can have a major effect on the run time performance of the 
program generated by the compiler. In some cases, a sig- 
nificant amount of time is spent doing such bookkeeping 
functions as loop iteration and branching, as opposed to the 
computations that are performed within the loop itself. 
30 These loops often implement scientific applications that 
manipulate large arrays and data instructions, and run on 
high speed processors. 

This is particularly true on modern processors, such as 
RISC architecture machines. The design of these processors 
35 is such that in general the arithmetic operations operate a lot 
faster than memory fetch operations. This mismatch 
between processor and memory speed is a very significant 
factor in limiting the performance of microprocessors. Also, 
branch instructions, both conditional and unconditional. 
40 have an increasing effect on the performance of programs. 
This is because most modern architectures are super- 
pipelined and have some sort of a branch prediction algo- 
rithm implemented. The aggressive pipelining makes the 
branch misprediction penalty very high. Arithmetic instruc- 
45 tions are interregister instructions that can execute quickly, 
while the branch instructions, because of mispredictions, 
and memory instructions such as loads and stores, because 
of slower memory speeds, can take a longer time to execute. 
Modern compilers perform code optimization. Code opti- 
50 mization consists of several operations that improve the 
speed and size of the complied code, while mai n taining 
semantic equivalence. Common optimizations include: 
prefetching data so that they are available in cache 
memory when needed; 
55 detecting calculations as computing constants and per- 
forming the calculation at compile time; 
scalar replacement which keeps the value of a variable in 
a register within the loop; 
60 moving calculations outside of loops where possible; and 
performing code scheduling, which consists of rearrang- 
ing the order of and modifying instructions to achieve 
faster running but semantically equivalent code. 
Many modem compilers also employ an optimizing tech- 
65 nique known as loop unrolling to generate faster running 
code. In its essence, loop unrolling takes the inner loop, i.e. 
the code between the beginning and the end of the loop, and 
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repeats it in the inner loop some number of times, e.g. four 
times. It then executes the unrolled loop one-fourth as many 
times as it would have executed the original loop. The 
number of times the loop is replicated within the unrolled 
loop is called the unroll factor. Because the number of times 
the original loop is executed is not always divisible by the 
unroll factor, a compensation loop code often has to be 
generated to execute the remaining of instructions of the 
original loop that are not executed by the unrolled loop. 

As discussed above, such loops as DO. FOR and WHILE 
loops are common in programs, especially in scientific and 
other time-consuming programs. Frequently 80% of the 
running time of a program can be in a few small loops. As 
a result anything that can speed up such loops is of great 
value in making a more efficient compiler. 

Consider the simple loop shown on FIG. 3. The three 
instructions in the inner loop 150 are executed N times. 
According to the prior art this loop can be unrolled with an 
unroll factor of four to produce the code shown on FIG. 4. 
where the inner loop (less the exit condition) is replicated 4 
times 333. This loop is also followed by a test 303 to see if 
the full original loop has been completed, and a compensa- 
tion loop 155 which is executed to the complete the original 
loop trip count if it has not been completed. 

An inspection of the loop shows that it is semantically 
equivalent to the loop of FIG. 3 because the same function 
is performed and the same result is achieved. However, the 
loop of FIG. 4 runs much faster for several reasons. First, the 
conditional branch 159 which exits the loop is executed only 
once for each time through the unrolled loop rather than 
once for each time through the original loop. Assuming an 
unroll factor of four as is shown here, this saves 3 A of a 
conditional branch per original loop iteration. More 
importantly, other compiler optimizations interact with loop 
unrolling and are able to do a much better job of optimizing 
the unrolled loop, such as that identified by numeric desig- 
nator 333. as compared to the original loop, identified by 
numeric designator 150. In an unrolled loop, there are more 
operations that could be scheduled in parallel, more oppor- 
tunity to do scalar replacement and other optimizations, and 
more possibilities to do prefetching. 

The tests identified by numeric designators 301 and 303 
are also of interest. These are the conditional branches which 
have a higher probability of being mispredicted. Anything 
mat can be done to eliminate one of them will be very useful. 

Prior art loop unrolling techniques have certain disadvan- 
tages. For example. J. J. Dongara and A. R. Hinds, Unrolling 
Loops in FORTRAN, describes how one can unroll loops 
manually by duplicating code. This is an early solution to 
optimizing code that it is not even implemented by the 
compiler. 

S. Weiss and J. E. Smith, A Study of scalar compilation 
techniques for pipeline supercomputers, discuss unrolling in 
the compiler. Here the authors address the simple situations: 

a) Cases where the loop count is known at compile time. 
They do not address loop unrolling when the loop count 
is only known at run time. 

b) Cases where the loop exit appears only at the beginning 
or end of the loop. They do not address the situation of 
unrolling loops with early exits (loops whose exit may 
occur in the middle of the loop). 

The authors do not address the following issues: 

a) Determining whether to place the compensation code 
before or after the unrolled loop. 

b) Tuning of the iteration count to reduce branch mispre- 
diction. 

c) Factors that affect the unroll factor. 
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L. J. Hendren and G. R. Gao. Designing Programming 
Languages for the Anatyzability of Pointer Data Structures. 
addresses the issue of unrolling loops as part of compiler 
optimization. They do not however discuss: 
5 a) Unrolling loopslwith early exits; 

b) Tuning of the iteration count to reduce branch mispre- 
diction; 

cj Factors that affect the unroll factor; 
10 d) Whether to place a compensation code at the beginning 

or end of the loops; and 
e) Compiling loops whose trip count is only known at run 

time. 

J. Davidson. S. Jinturkar, Aggressive Loop Unrolling in a 
15 Retargetable, Optimizing Compiler. Dept. of Computer 
Science, Thornton Hall. University of Virginia disclose a 
code transformation, referred to as aggressive loop 
unrolling, in a retargetable optimizing compiler where the 
loop bounds are not known at compile time. Various factors 
20 were analyzed to determine how and when loop unrolling 
should be applied, resulting in an algorithm for loop unroll- 
ing in which execution-time counting loops (i.e. a counting 
loop whose iteration count is not trivially known at compile 
time) are unrolled and loops having complex control -flow 
25 are unrolled. However, they do not discuss: 

a) Unrolling loops having early exits; 

b) Tuning of the iteration count to reduce branch mispre- 
diction 

c) Factors that affect the unroll factor; and 

30 d) Whether to place a compensation code at the beginning 
or end of the loops. 
Another part of the loop unrolling prior art is shown on 
FIG. 7, which illustrates a loop having an early exit (also 
referred to as a WHILE loop), consisting of an exit test and 

35 branch 403 in the middle of the loop between computations 
401 and 405. According to the prior art, the code, including 
the exit 403, is replicated four times in an unrolled loop. The 
number of branches in the loop are however not reduced 
with this optimization. 

40 While some of the techniques discu ssed in the prior art are 
applicable to compilers for all computers, only some of them 
are particularly applicable for modern RISC computers, 
where branch instructions form a lot bigger bottleneck than 
in earlier technologies. Also compilers for RISC architec- 

45 lures are a lot more aggressive and the interactions of 
various optimizations plays a key role in the quality of the 
final code. 

SUMMARY OF THE INVENTION 

50 

The invention provides a new compiler that can unroll 
more loops than previous algorithms. It also significantly 
reduces the number of branch instructions by cleverly han- 
dling the iteration count and by converting loops with early 
55 exits to regular FOR loops. The invention also provides for 
computing the unroll factor and the placement of the com- 
pensation loop by taking a lot of other optimizations into 
consideration. 

The compiler: 

60 Eliminates time consuming conditional branch instruc- 
tions from the compensation code loop by replacing the 
conditional exit of the main unrolled loop to always 
exit with at least one iteration which has yet to be 
executed by the compensation code. This eliminates the 

65 need to test for zero remaining loops. 

Determines whether it is better to place the compensation 
code at the beginning or the end of the unrolled loop 
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according to which one would likely provide the better 
optimization. Generally, it prefers to put the compen- 
sation loop in front of the main loop if the unroll factor 
is a power of two and after the main loop if the unroll 
factor is not a power of two. 

Computes the unroll factor by taking into account the 
interactions of other optimizations like prefetch, scalar 
replacement and register allocation, and also taking 
into account hardware features like number of func- 
tional units. Unrolling loops over-aggressively or 
under-aggressively can inhibit other optimizations or 
make them less effective. 

Converts loops with early exit to loops with exit at the end 
to apply more efficient optimizations to the loop. It does 
this by ensuring that the compensation code is always 
executed at least once, enabling the compiler to elimi- 
nate the exit tests from the unrolled loop. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 shows a hlock schematic diagram of a uniprocessor 
computer architecture including a processor cache; 

FIG. 2a shows a block schematic diagram of a modern 
software compiler; 

FIG. 2b shows schematic diagram of the low level opti- 
mizer; 

FIG. 3 shows a simple program loop; 

FIG. 4 shows the loop of FIG. 3 that has been unrolled 
four times according to the prior art; 

FIG. 5 shows the loop of FIG. 3 unrolled according to the 
invention and having an eliminated branch; 

FIG. 6 shows an unrolled loop having pre-loop compen- 
sation code; 

FIG. 7a shows a simple WHILE loop; 

FIG. 7b shows the WHILE loop of FIG. 7a that has been 
unrolled according to the prior art; 

FIG. 8 shows the WHILE loop of FIG. 7a that has been 
unrolled according to the invention; 

FIG. 9 shows a schematic representation of a simple loop; 

FIG. 10 shows the loop of FIG. 9 with scalar replacement 
according to the prior art; 

FIG. 11 shows the loop of FIG. 10 unrolled; 

FIG. 12 shows a schematic representation of the loop of 
FIG. 11 after copy elimination; 

FIG. 13 shows a schematic representation of the compiler 
logic which determines the unroll factor, compensation code 
placement and other optimizations; and 

FIG. 14 is a block schematic diagram of a compiler for a 
programmable machine in accordance with the invention. 

DETAILED DESCRIPTION OF THE 
INVENTION 

The invention provides a new compiler that features smart 
unrolling of loops. 

The invention provides a prefetch driver 34 that operates 
in concert with such known techniques. The following 
discussion pertains to the various elements of the low level 
optimizer shown on FIG. 2b; 

Local optimizations include code improving transforma- 
tions that are applied on a basic block by basic block basis. 
For purposes of the discussion herein, a basic block corre- 
sponds to the longest contiguous sequence of nMchine 
instructions without any incoming or outgoing control 
transfers, excluding function calls. Examples of local opti- 
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mizations include local common sub-expression elimination 
(CSE). local redundant load eUmination. and peephole opti- 
mization. 

Global optimizations include code improving transforma- 
5 tions that are applied based on analysis that spans across 
basic block boundaries. Examples include global common 
sub-expression elimination, loop invariant code motion, 
dead code elimination, register allocation and instruction 
scheduling. 

10 Loop invariant code motion is the identification of 
instructions located with a loop that compute the same result 
on every loop iteration and the re-positioning of such 
instructions outside the loop body. 

Register allocation and instruction scheduling is the pro- 
15 cess of assigning hardware registers to symbolic instruction 
operands and the re-ordering of instructions to rmnimize 
run-time pipeline stalls where the processor must wait on a 
memory fetch from main memory or wait for the completion 
of certain complicated instructions that take multiple cycles 
20 to execute (eg. divide, square root instructions). 

One irnportant phase of the compiler identifies loops and 
access patterns to estimate how many cycles are devoted to 
loop iterations. In the invention, the compiler translates the 
higher level application code into an instruction stream that 
the processor executes, and in the process of this translation 
the compiler unrolls loops. 

The longer unrolled loops allow the compiler to provide 
several advantages, such as: 
30 1) It eliminates the extra branch exits. This saves CPU 
cycles by not having to execute the branch instructions 
and also helps reduce branch misprediction. Why this is 
important is evident if one considers most modern 
RISC architectures. These architectures have a long 
35 pipeline that is fed by an instruction fetch mechanisnx 
When the fetch mechanism encounters a branch, it tries 
to predict if the branch is going to be taken or not. It 
then fetches instructions based on this prediction. The 
prediction is necessary to keep the pipeline from stall- 
40 ing. If the architecture's prediction is correct (this is 
determined when the branch instruction completes 
execution which is a few cycles after it has been 
fetched), then everything works fine; else ail the 
instructions that have been fetched after the branch are 
45 discarded and the new instructions fetched based on the 
correct outcome of the branch. This penalty of discard- 
ing fetched instructions and fetching new ones, when 
the branch is mispredicted, is known as the branch 
misprediction penalty and it is very significant for most 
50 modern architectures. It is of the order of 5-10 cycles 
per branch instruction that is mispredicted. By reducing 
the branch instructions, the number of branches that get 
mispredicted automatically reduces. 
2) It can better insert prefetches and effect other optirai- 
55 zations into the longer inner loop code. When the loop 
is unrolled, there are more memory instructions in the 
loop and also the memory stride (the distance between 
the memory accesses of an instruction in two consecu- 
tive iterations) is bigger. If a loop is unrolled four times, 
60 the memory stride goes up by four. This helps the 
prefetch to do a more effective job. When the memory 
stride increases, as long as it is less than the cache line 
size (which is architecture dependent), the prefetches 
become more effective. When the memory stride 
65 becomes greater than the cache line size the prefetches 
can hurt Hence the loops should be unrolled such that 
the memory stride is lesser than the cache line size 
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whenever prefetch instructions are going to be gencr- loop 383 is in front of the repetitive unrolled loops 347. In 

ated for the loop and whenever possible. the general case, putting the compensation code before the 

3) Scalar replacementyrecurrence elimination inserts cop- main unrolled loop is less efficient than putting afterwards, 
ies at the loops to keep the value of a variable live in because calculating the loop trip count requires a remainder 
the next iteration (see. for example FIGS. 9. 10. and 5 operation which involves high latency divide operations. 
11). These copies can be eliminated by unrolling the However, if the unroll factor is a power of two, as in this 
loop a certain number of times. case where the unroll factor is 4. the remainder calculation 

4) Longer sequences of non-branching instructions can is a simple shift operation. Because unroll factors of 2. 4. or 
achieve an overlap between instructions that have noth- g are common, the compensation code can be placed in front 
ing to do with memory and those that do. This is known i 0 [ n f ront Q f me unroll loop for negligible cost. As a practical 
as instruction scheduling and explained below. While matter, it is often advantageous to put the compensation loop 
the access time between the processor and the cache is in front of me unro ]jed loop to benefit from other optirai- 
typically 1 to 5 cycles, the retrieval time from cache to 2ations such as register ^association. When the compensa- 
meraory is often on the order of 10 to 100 cycles. When tion . is lacc(J M(XC mc unrollcd ioQ p. the variable that 
the processor actually gets to the point where the data j$ ^ of ^ CQum fe QOt always needed ^ 
item is >needed from memory, if ^e data is nor incache. { Wneil me compensation loop is placed 
it might take 100 processor cycles to fetch rt from main J ^eb ncedcd aftef 
memorv. Where the compiler can optimize the longer .. . V . . , 
Soop code, i™ may only be necessary to wait for 20 * e unrolled loop as there is an exposed use in the compen- 
cycles because 80 cycles worth of loV* up time is nation loop. This exposed us< .can mhibit ag^swe register 
hidden or overlapped with the execution of other 20 reassociation. In the preferred embodiment, the architecture 
instructions. of ^ computer and the interactions with other optimiza- 

Loop unrolling is integrated with other low level optimi- tions dictate an unroll factor, and if U is a power of two. the 

zation phases, such as the prefetch insertion algorithm. compensation code is inserted in front of the unroll loop, 

register reassociation, and instruction scheduling. The new Another optimization technique that is part of the inven- 

compiler yields significant performance improvements for 25 tion herein disclosed rearranges loops with early exits 

some industry-standard performance benchmarks, for (which are henceforth referred to as WHILE loops). These 

example on the SPEC92 and SPEC95 benchmarks on the loops are characterized by the fact that some of the inner 

Hewlett-Packard Company (Palo Alto, Calif.) PA-8000 pro- loop code is done before the loop test, and some after the 

cessor. loop test as is shown on FIG. 7a. Here the loop has an exit 

The following discussion explains compiler operation in 30 branch 403 in the middle with inner loop operations 401 

the context of a loop within an application program. Loops before it, and other inner loop operations 405 after it. 

are readily recognized as a sequence of code that is itera- The optimization taught in the prior art for this loop is 

tively executed some number of times. The sequence of such shown on FIG. lb. Notice that the whole inner loop, 

operations is predictable because the same set of operations including the exit instruction, is replicated 377 four times, 

is repeated for each iteration of the loop. It is common 35 This can be improved by converting the unrolled loop into 

practice in an application program to maintain an index a FOR loop with an exit condition of (unroll factor +1) as 

variable for each loop that is provided with an initial value. opposed to unroll factor (e.g. 5 instead of 4 in this case), as 

and that is incremented by a constant amount for each loop is shown on FIG. 8. This guarantees that the unrolled loop 

iteration until the index variable reaches a final value. The is exited before it would have to exit due to the WHILE 

index variable is often used to address elements of arrays 40 condition. Because none of the branches at 377 on FIG. lb 

that correspond to a regular sequence of memory locations. are executed, the WHILE exit instruction can be removed, as 

In the compiler, it has been found that the low level is shown at 319 on FIG. 8. Thus, there is only one place that 

optimizer component of a compiler is in a good position to there is a WHILE loop exit, i.e. at 365. 

deduce the number of cycles required by a stretch of code The technique herein disclosed ensures that the unrolled 

that is repetitively executed and this information can be used 45 loop exits before it would take any of the WHILE loop exits, 

to determine the optimal unroll factor. As discussed above. so that the WHILE test can be removed from the unrolled 

the concept of loop unrolling is not new. but use of smart loop. It is necessary to ensure that the compensation code is 

unrolling is new. For example, FIG. 4 shows the loop of FIG. always executed at least once. 

3 after the loop has been unrolled four times. Thus, instead The following discusses how the unroll factor interacts 

of executing the loop 100 times if N were 4. the loop is so with the scalar replacement optimization. This is particularly 

executed 25 times. important because the form of this optimization determines 

FIG. 5 shows the output code that is generated by the the unroll factor. Consider the loop shown on FIG. 9. Notice 

invention in contrast to the code generated by the prior art. that the value of A[I] stored in the inner loop at 367 is loaded 

as shown on FIG. 4. The replicated inner loops 333 are the again two loop iterations later, when the same statement 

same. Also, the compensation loop 323 is the same as the 55 loads A[I-2] with a value of I which is incremented by 2. The 

prior art compensation loop 155 of FIG. 4. However, the idea behind scalar optimization which is well known in the 

loop test at 301 and 159 of FIG. 4 now tests to exit if I>=5 prior art. is to save array values in temporary variables if 

rather than I>=4. as can be seen at 311 and 321 of FIG. 5. they are accessed shortly within the next few iterations. 

The effect is to ensure that the compensation loop is always Thus, the loop can be modified as is shown on FIG. 10. 

executed at least once. This eliminates the need to test for the 60 Here, the array reference A| 1-2 1 at 391 is replaced with T, 

zero case (303 in FIG. 4). This eliminates the branch and it is followed by two instructions at 393 and 395 which 

instruction 303 on FIG. 4. As indicated above, the elimina- assign values to T and Tl. T\vo scalar temporary variables 
tion of this branch instruction significantly increases the are necessary because the value of the two most recent array 
speed of the compiled code by reducing the number of values must be saved. The value of A|I-1| would have been 

branch instructions that get mispredicted. 65 stored in the previous iteration in Tl and that is going to be 

It is also possible to put the compensation code in front of used in the next iteration. We move Tl to T and T will be 
the main loop, as is shown on FIG. 6, Here the compensation accessed in the next iteration. Similarly T 1 , to which A| 1 1 is 
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assigned will be moved to T in the next iteration and used 
2 iterations from now. * * 

The instruction at 395 appears to make an indexed refer- 
ence to the array A and that suggests that an array access 
must be made to get the number to put into Tl. which would 
be a high latency operation and would lose all that was 
gained by the optimization. What actually happens is^the 
optimizer recognizes that the value of Al is stored two 
instructions earlier at 391. and that A|I| is resident in a 
register which can be stored into Tl without accessing A|I|. 
As Tl is likely to be assigned to a register, this operation is 
a register to register instruction. 

The foregoing illustrates how the various optimization 
techniques are interrelated allowing the loop unrolling opti- 
mizer to generate code which is clearly not optimal in itself 
but is optimized by other optimizers in the compiler. Ini- 
tialization code is inserted at 388 to define initial values of 
T and Tl. 

FIG. 11 shows how such a loop can be unrolled according 
to the prior art if an unroll factor of three had been chosen. 
The prior art does not specify the selection of an unroll 
factor of thxee here, but a factor of three, or a multiple of 
three is optimal because it allows other parts of the optimizer 
to generate the code shown on FIG. 12. The selection of an 
unroll factor here of three or a multiple of three is important 
because three values of the array must be kept A|I|. A|I-1] 
and A| 1-2 1 if the array references are to be avoided. 

Three variables T. Tl and T2 are used, making it possible 
for other well known code optimization techniques to gen- 
erate code, such as that shown on FIG. 12. This eliminates 
shuttling the temporary data from T to Tl. This is an 
example of where the nature of the code in the loop and its 
effect on scalar optimization forces a particular unroll factor. 
In general, one lists the various indexes in the loops and sorts 
them to notice the maximum distance between them. 

In the above case the distance between the index L and 
1-2, is 2. Adding one to this value computes a primary unroll 
factor, which is three in this case. This is an acceptable 
unroll factor. However, if it turned out to be very small, one 
might want to multiply it by a constant to get a larger unroll 
factor. Alternatively, if the primary unroll factor was very 
large, one might want to divide it by the loop increment of 
it was a number other than one. The reasons for selecting a 
particular unroll factor are discussed below. 

Determining the unroll factor. 

Classically the prior art uses a standard unroll factor for 
all loops. Typically the number used is four. In the invention, 
the unroll factor is calculated for each loop depending on 
various factors. At one extreme some loops are not unrolled 
at all, and other loops are unrolled eight or even more times. 
The disadvantages of picking too large an unroll factor are: 

1. All the loop instructions need not fit into the instruction 
cache leading to a lot of I-cache misses. 

2. The higher the unroll factor, the higher the memory 
stride of memory instructions across iterations. If the 
memory stride exceeds cache line size, the effective- 
ness of prefetch decreases. 

3. The resulting code is longer. Usually an upper bound 
must be chosen, an unroll count of 1000 is not likely to 
be a good idea since the compile time can go up 
significantly. Also excessive unrolling can adversely 
affect other optimizations which have bounds on the 
number of transformations they can make. 

On the other hand if a small unroll factor is chosen the 
following problems can occur: 

1. Much more time is spent executing the high latency 
branch instruction which closes the loop. 
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2. The short inner loop provides many fewer opportunities 
for optimization than longer inner loops. Where the 
inner loop has high latency instructions, the compiler 
can often have them execute in parallel with low 
5 latency time instructions. This may not be possible in 
very short loops. 
One must keep in mind that the compiler is compiling 
loops that range from single instruction inner loops to loops 
that have scores or even hundreds of instructions, and so the 
10 compiler must compile code balance these considerations to 
achieve good unroll factors. To determine the unroll factor, 
the compiler considers the following in decreasing order of 
importance: 

1. There is a maximum value of the unroll factor (which 
15 in the preferred embodiment of the invention is eight). 

2. The number of instructions in the unrolled loop must 
not exceed a specific limit. This provides another upper 
bound to the unroll factor. 

3. If there are references to previous indexed contents of 
the array such as was shown in FIGS. 10 through 12. an 
unroll factor suggested by this analysis (or a multiple of 
it) should be used. 

4. If prefetch instructions are being generated (this is 
known based on a user defined flag), then try to pick an 
unroll factor that keeps the value of the strides of array 
references within the loop below the cache line size. 

5. If the trip count is a constant known at compile time, 
then an unroll factor that eliminates the need for a 

^ compensation code loop should be selected. Typically 
this would be an unroll factor of 2. 4 or 8. although 
other numbers such as 3 or 5 might be possible. 

6. If there is profile information, use that. If the profile 
informations says that the loop iterates on an average 

35 k times, if k is smaller than the maximum value of the 
unroll factor as dictated by the previous steps, use fc*. 
else use the maximum value of the unroll factor. 

7. If there are high latency operations within the loop such 
as divide and square root operations, use an unroll 

40 factor that will enhance the maximum overlap of these 
instructions. For instance, if the architecture has two 
divide units and the loop has a single divide instruction, 
the loop should be unrolled an even number of times so 
that both the divide units can be kept simultaneously 

45 busy. 

The algorithm that computes the unroll factor tries to 
compute an optimal and acceptable unroll factor. The cost of 
a nonoptimal unroll factor is slower run time code. As 
discussed above, the algorithm is sensitive to profile data, 

50 number of instructions in the loop, architecture features like 
functional units and cache line size, interactions with other 
optimizations and constant trip counts. 

Attention is directed to FIG. 13. which shows how the 
optimization algorithms presented here are implemented. At 

55 201 the unroll factor is determined as described above. Next, 
at 203. a check is made for the special case where the trip 
count is known at compile time and is a multiple of the 
unroll factor. In this case, the unrolled loop code is generated 
from the original loop code, any middle exits are removed 

60 leaving only the final exit and the unrolled code is output at 
242. Because this is an unrolled loop which needs no 
compensation code, none is output. The other exit occurs at 
203 where the trip count is not known at compile time, or the 
trip counts and unroll factor are such that compensation code 

65 must be generated. Control goes to 205 where the unrolled 
code is generated. For non-WHILE loops, the middle exits 
are removed leaving only the final exit. For a WHILE loop. 
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the final exit is moved to the end of the loop (226) instead 
of the middle and all other exits removed. At this time a 
determination (at 229) is made using the unroll factor to 
determine if the compensation code should be output before 
or after the unrolled loop code. If it should be after, control 
goes to 244. otherwise control goes to 246. All of these three 
control paths then meet at 999. terminating the unrolling 
optimization. 

FIG. 14 is a block schematic diagram of a compiler for a 
programmable machine in accordance with the invention. 
The compiler of FIG. lb shows a loop unrolling module 30. 
The preferred embodiment of the invention provides a loop 
unrolling module that is placed within the compiler as 
shown in FIG. 2b. As shown in FIG. 14. the compiler 
comprises an analysis module 1501 for analyzing and 
unrolling loops within source applications. An optimizer 
module 1302 determines an optimum unroll factor in 
response to the analysis module. An unroll module 1303 
generates an unrolled loop having said optimum unroll 
facto, while a compensation module 1304 generates and 
places any compensation code as required as a result of loop 
unroll optimization. 

Although the invention is described herein with reference 
to the preferred embodiment, one skilled in the art will 
readily appreciate that other applications may be substituted 
for those set forth herein without departing from the spirit 
and scope of the present invention. Accordingly, the inven- 
tion should only be limited by the claims included below. 

We claim: 

1. In a programmable machine, a compiler comprising: 
an analysis module for analyzing and unrolling loops 

within source applications; 
an optimizer module for determining an optimum unroll 

factor in response to said analysis module; 
an unroll module for generating an unrolled loop having 

said optimum unroll factor; and 
a compensation module for generating and placing any 

compensation code as required as a result of loop unroll 

optimization. 

wherein said compensation module performs a placement 
calculation to determine whether to put said compen- 
sation code before the unrolled loop or after the 
unrolled loop. 

2. The compiler of claim 1, wherein said compensation 
module ensures that said compensation code is executed at 
least once when said unrolled loop is executed. 

3. The compiler of claim 1. wherein said unroll factor is 
responsive to a number of instructions in the loop. 
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4. The compiler of claim 1, wherein said unroll factor is 
responsive to resource usage within the loop. 

5. The compiler of claim 1, wherein said unroll factor is 
responsive to prefetch distance of memory references within 

5 the loop. 

6. The compiler of claim 1. wherein said unroll factor is 
responsive to profile information collected from previous 
executions of compiled code. 

7. The compiler of claim 1. wherein said unroll factor is 
10 responsive to recurrence of memory references within the 

loop. 

8. The compiler of claim 1. wherein said unroll factor is 
responsive to the number of instructions in the loop. 

15 9. The compiler of claim 1. wherein a trip count for the 
loop is known at compile time, and wherein said optimizer 
module determines an unroll factor that executes the com- 
pensation code zero times and suppresses the generation of 
said compensation code. 

20 10. The compiler of claim 1. wherein said placement 
calculation is responsive to the unroll factor computed for 
the loop. 

11. The compiler of claim 1. wherein said loop is a loop 
having an early exit. 
25 12. The compiler of claim 1. wherein loop unrolling is 
integrated with other low-level optimization phases. 

13. The compiler of claim 12. wherein said other low- 
level optimization phases include any of prefetch instruction 
insertion, register reassociation. and instruction scheduling. 
30 14. A method for unrolling loops, comprising the steps of: 
determining a unroll factor 

generating an unrolled loop which always exits leaving a 
remaining trip count of at least one; 
35 generating compensation code; 

determining whether the compensation code should be 
placed before the unrolled loop or after the unrolled 
loop; and 

determining if the trip count is a power of two; and if it 
40 is placing the compensation code before the unrolled 
loop. 

15. The method of claim 14. wherein the loop to be 
unrolled is a loop having an early exit 

16. The method of claim 15. wherein the loop having an 
45 early exit after unrolling is transformed into a loop having an 

exit at its end, and from which all intermediate exits have 
been removed. 

***** 
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